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Abstract 


Distantly-Supervised Named Entity Recogni- 
tion (DS-NER) effectively alleviates the burden 
of annotation, but meanwhile suffers from the 
label noise. Recent works attempt to adopt the 
teacher-student framework to gradually refine 
the training labels and improve the overall ro- 
bustness. However, we argue that these teacher- 
student methods achieve limited performance 
because the poor calibration of the teacher net- 
work produces incorrectly pseudo-labeled sam- 
ples, leading to error propagation. Therefore, 
we attempt to mitigate this issue by propos- 
ing: (1) Uncertainty-Aware Teacher Learning 
that leverages the prediction uncertainty to re- 
duce the number of incorrect pseudo labels in 
the self-training stage; (2) Student-Student Col- 
laborative Learning that allows the transfer of 
reliable labels between two student networks in- 
stead of indiscriminately relying on all pseudo 
labels from its teacher, and further enables a 
full exploration of mislabeled samples rather 
than simply filtering unreliable pseudo-labeled 
samples. We evaluate our proposed method 
on five DS-NER datasets, demonstrating that 
our method is superior to the state-of-the-art 
DS-NER denoising methods. 


1 Introduction 


Named Entity Recognition (NER) aims to detect 
entity spans in text and then classify them into pre- 
defined categories, which plays an important role in 
many applications such as dialogue systems (Li and 
Zhao, 2023; Liu et al., 2023; Si et al., 2022a, 2024). 
However, deep learning-based NER methods usu- 
ally require a substantial quantity of high-quality 
annotation for training models, which is not only 
exceedingly costly but also time-consuming. 

To alleviate the burden of annotation, Distantly- 
Supervised Named Entity Recognition (DS-NER) 
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Figure 1: A sample generated by DS-NER. “Amazon’ 
and “Washington” are inaccurate annotations. “Arafat" 
and “rainforest” are the incomplete annotations. 


is widely used in real-world scenarios. It can au- 
tomatically generate massive labeled training data 
by matching entities in existing knowledge bases 
with snippets in text. However, DS-NER suffers 
from two inherent issues: (1) Inaccurate Annota- 
tion: due to the context-free matching, the entity 
with multiple types in the knowledge bases may be 
labeled as an inaccurate type, and (2) Incomplete 
Annotation: due to the limited coverage of knowl- 
edge bases, many entity mentions in text cannot be 
matched and are wrongly labeled as non-entity. As 
shown in Figure 1, the entity types of "Washing- 
ton" and "Amazon" are wrongly labeled owing to 
context-free matching, and "Arafat" is not recog- 
nized due to the limited coverage of resources. 


Therefore, many works attempt to address these 
issues (Peng et al., 2019; Zhou et al., 2022; Lietal., 
2021; Si et al., 2022b, 2023). Recently, the self- 
training teacher-student framework in DS-NER has 
attracted increasing attention (Liang et al., 2020; 
Zhang et al., 2021a; Qu et al., 2023), as it can 
handle inaccurate and incomplete labels simulta- 
neously, and use generated pseudo labels to make 
full use of the mislabeled samples from DS-NER 
dataset. This self-training framework firstly uses 
generated reliable pseudo labels from the teacher 
network to train the student network, and then up- 
dates a new teacher by shifting the weights of the 
trained student. Through this self-training loop, the 


training labels are gradually refined and model gen- 
eralization can be improved. Specifically, BOND 
(Liang et al., 2020) designs a teacher-student net- 
work and selects high-confidence pseudo labels as 
reliable labels to get a more robust model. SCDL 
(Zhang et al., 2021b) further improves the perfor- 
mance by jointly training two teacher-student net- 
works, then selects consistent and high-confidence 
pseudo labels between two teachers as reliable la- 
bels. ATSEN (Qt et al., 2023) designs two teacher- 
student networks by considering both consistent 
and inconsistent high-confidence pseudo labels be- 
tween two teachers and also proposes fine-grained 
teacher updating to achieve advanced performance. 

The above teacher-student methods highly rely 
on using the high-confidence pseudo labels (e.g., 
pseudo labels with confidence values greater than 
0.7) as reliable labels, as they assume that the 
teacher model’s predictions with high confidence 
tend to be correct. However, this assumption may 
be far from reality. Neural networks are usually 
poorly calibrated (Guo et al., 2017; Rizve et al., 
2021), i.e., the probability associated with the pre- 
dicted label usually reflects the bias of the teacher 
network and does not reflect the likelihood of its 
ground truth correctness. Therefore, a poorly cal- 
ibrated teacher network can easily generate incor- 
rect pseudo labels with high confidence. We argue 
that previous teacher-student methods achieve lim- 
ited performance because poor network calibration 
produces incorrect pseudo-labeled samples, lead- 
ing to error propagation. 

We aim to reduce the effect of incorrect pseudo 
labels within the teacher-student framework by 
unCertainty-aware tEacher aNd Student-Student 
cOllaborative leaRning (CENSOR). Specifically, 
we apply two teacher-student networks to provide 
multi-view predictions on training samples. We 
propose Uncertainty-aware Teacher Learning that 
leverages the prediction uncertainty to guide the 
selection procedure of pseudo labels. Then, we use 
both uncertainty and confidence as indicators to se- 
lect pseudo labels, reducing the number of incorrect 
pseudo labels selected by confidence scores from 
poorly calibrated teacher networks. We only select 
the pseudo labels with high confidence and low 
uncertainty as reliable labels, since these selected 
labels are more likely to contain less noise. Subse- 
quently, to further reduce the risk of learning incor- 
rect pseudo labels and make a full exploration of 
mislabeled samples, we introduce Student-Student 
Collaborative Learning that allows the transfer of 


reliable labels between two student networks. In 
each batch of data, each student network views its 
small-loss pseudo labels (e.g., pseudo labels of 10% 
samples with the smallest loss) as reliable labels 
and then teaches such reliable labels to the other stu- 
dent network for updating the parameters. In this 
way, a student network does not completely rely 
on all the pseudo labels from its poorly calibrated 
teacher network. Meanwhile, different from just fil- 
tering unreliable pseudo-labeled samples, this com- 
ponent provides the opportunity for the incorrect 
pseudo-labeled samples to be correctly labeled by 
the other teacher-student network, allowing the full 
exploration of training data. Experiments demon- 
strate that our method significantly outperforms 
previous methods, e.g., improving the F1 score by 
an average of 1.87% on five DS-NER datasets. 


2 Related Work 


To alleviate the burden of annotation, previous stud- 
ies attempted to annotate NER datasets via distant 
supervision, which suffers from noisy annotation. 


DS-NER Methods To address these issues, vari- 
ous methods have been proposed. Several studies 
(Shang et al., 2018; Yang et al., 2018; Jie et al., 
2019) modify CRF to get better performance under 
the noise. Peng et al. (2019); Zhou et al. (2022) 
try to employ PU learning to obtain the unbiased 
estimation of loss value. Li et al. (2021, 2022) intro- 
duce negative sampling to mitigate the misguidance 
from unlabeled entities. Liang et al. (2020); Zhang 
et al. (2021b); Qu et al. (2023) adopt the teacher- 
student framework to handle both inaccurate and 
incomplete labels simultaneously. In this paper, 
we attempt to reduce the effect of incorrect pseudo 
labels and error propagation in the teacher-student 
framework to achieve better performance. 


Teacher-Student Framework Teacher-student 
framework is a popular architecture in many semi- 
supervised tasks (Huo et al., 2021). Recently, the 
teacher-student framework has attracted increasing 
attention in DS-NER task. BOND (Liang et al., 
2020) firstly attempts to apply self-training with a 
teacher-student network in DS-NER. SCDL (Zhang 
et al., 2021b) further improves the performance by 
jointly training two teacher-student networks. AT- 
SEN (Qu et al., 2023) considers both consistent and 
inconsistent predictions between two teachers and 
proposes fine-grained teacher updating to achieve 
more robustness. We improve the teacher-student 
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Figure 2: General architecture of CENSOR, which consists of two teacher-student networks. [©] means the teacher 
network first generates pseudo labels. [@] means estimating the confidence and uncertainty of generated pseudo 
labels. [®] means selecting reliable pseudo labels according to confidence and uncertainty, where masked pseudo 
labels will not be used to update the student network. [©] means using Student-Student Collaborative Learning to 
transfer the reliable pseudo labels. [©] means using selected reliable pseudo labels to update the corresponding 
student network. [©] means updating a new teacher by shifting the weights of the trained student. 


framework by Uncertainty-Aware Teacher Learn- 
ing and Student-Student Collaborative Learning, 
jointly reducing the effect of incorrect pseudo la- 
bels. In this way, our method can avoid error prop- 
agation and achieve better overall performance. 


3 Task Definition 


Given the training corpus Dg, where each sample 
(x;, yi), L; represents i-th token, and y; is the label. 
Each entity is a span of the text, associated with an 
entity type. We use the BIO scheme for sequence la- 
beling. The beginning token of an entity is labeled 
as B-type, and others are I-type. The non-entity 
tokens are labeled as O. Traditional NER is a su- 
pervised learning task based on a clean dataset. We 
focus on the practical scenario where the training 
labels are noisy due to distant supervision, i.e., the 
revealed tag y; may not correspond to the underly- 
ing correct one. Thus, the challenge of DS-NER is 
to reduce the negative effect of noisy annotations. 


4 Methodology 


As shown in Figure 2, CENSOR consists of two 
teacher-student networks to handle the noisy label. 
To avoid overfitting the incorrect pseudo labels gen- 
erated by poorly calibrated teacher networks, we in- 
troduce Uncertainty-Aware Teacher Learning that 
leverages the prediction uncertainty to guide the 
label selection. We also propose Student-Student 
Collaborative Learning that allows reliable label 


transfer between two student networks, further re- 
ducing the risk of learning incorrect pseudo labels 
and making a full use of mislabeled samples. 


4.1 Teacher-student Framework 


Neural networks excel at memorization (Arpit et al., 
2017). However, when noisy labels become promi- 
nent, deep-learning-based NER models inevitably 
overfit noisy labeled data, resulting in poor perfor- 
mance. The purpose of the teacher-student methods 
is to select reliable labels (i.e., pseudo labels that 
are more likely to be labeled correctly), to reduce 
the negative effect of label noise. Self-training 
involves the teacher-student network, where the 
teacher network first generates pseudo labels to 
participate in label selection. Then the student is 
optimized via back-propagation based on selected 
reliable labels, and the teacher is updated by grad- 
ually shifting the weights of the student with an 
exponential moving average (EMA). Following Qu 
et al. (2023), we train two sets of teacher-student 
networks using two different NER models to pro- 
vide multi-view predictions on training samples. 


4.2 Uncertainty-Aware Teacher Learning 


In the DS-NER task, one of the main challenges 
of the teacher-student framework is to evaluate the 
correctness of the generated pseudo labels of the 
teacher model. Previous methods (Liang et al., 
2020; Zhang et al., 2021a; Qu et al., 2023) gener- 
ally assume that high-confidence predictions tend 


to be correct. Therefore, they select the samples 
with high-confidence pseudo labels (e.g., pseudo 
labels with confidence values greater than 0.7) as 
training data. However, the teacher network is 
prone to generating high-confidence yet incorrect 
pseudo labels due to the poor calibration (Guo et al., 
2017). This overconfidence is indicative of model 
bias rather than the true likelihood of correctness. 
Therefore, relying solely on the teacher network’s 
confidence as the indicator may not efficiently eval- 
uate the correctness of the pseudo labels. 
Meanwhile, we observe that when the NER 
model performs supervised learning on a misla- 
beled token, it receives two types of supervision 
from the incorrect label of the mislabeled token 
and the labels of semantically similar but correctly 
labeled tokens. For example, “Washington” in Fig- 
ure | is mislabeled as “LOC” (location), and the 
model trained with it tends to predict “Washing- 
ton” as “LOC” instead of “PER” (person). The 
model is also exposed to semantically similar but 
correctly labeled tokens, such as the token “James” 
labeled as “PER” in the training sentence “U.S. 
President will meet James at the White House’, 
thus the model may also learn to generalize "Wash- 
ington" as a “PER”. The knowledge in both types 
of supervision is eventually learned and saved to 
the network neurons. However, as the training con- 
tinues, the deep-learning-based model inevitably 
overfits the noisy labels due to its memorization 
capability (Arpit et al., 2017), rather than utilizing 
the correct knowledge learned from the labels of 
semantically similar but correctly labeled tokens. 


Uncertainty Estimation Based on our observa- 
tion, we find that randomly deactivating neurons 
introduces variability in predicted confidence of the 
incorrect pseudo label, which can be attributed to 
varying subsets of active neurons influencing each 
prediction. Specifically, the randomness of deacti- 
vation of the network neurons makes the remaining 
network neurons sometimes retain more knowl- 
edge learned from the incorrect label of the misla- 
beled token, and sometimes retain more knowledge 
learned from the labels of semantically similar but 
correctly labeled tokens. Consequently, such dis- 
crepancies can lead to inconsistencies in multiple 
predictions. For the correctly labeled tokens, since 
their labels are the same as those of semantically 
similar tokens, the two types of knowledge stored 
in the network neurons are more consistent, so the 
predictions from the different subsets of active neu- 


rons tend to be more consistent. Thus, we define the 
inconsistency of predictions from sampled teacher 
network neurons as uncertainty and evaluate the 
correctness of the generated pseudo labels. 

Specifically, given the new input token x* and 
the pseudo label %* generated by the teacher net- 
work W, we perform K forward passes with 
Dropouts (Krizhevsky et al., 2012) through our 
teacher networks at inference time. In each pass, 
pre-defined parts of network neurons are randomly 
deactivated. Then, we could yield / subsets of 
active neurons {W,, Wo, -+s5 Wert. To estimate the 
uncertainty for each token in the sequence labeling 
task, we leverage the variance of the model outputs 
for each token from multiple forward passes: 


Sun(y* = 9*|W, 2") = Var[p(y* = 9" |We, x"), (D 


where Var|.] is the variance of distribution over 
the K passes through the teacher network. The 
lower uncertainty indicates the predictions from 
sampled teacher network neurons and the learned 
knowledge are more consistent, thus the pseudo 
label is more likely to be correct. 


Uncertainty-Aware Label Selection Different 
from previous teacher-student methods only using 
confidence as the indicator to select reliable pseudo 
labels, we jointly consider the confidence and un- 
certainty in label selection. For the confidence of 
the pseudo label 4*, as follows: 


gy” = argmax(p(y"|W, x") 

seo(y" = 9" |W, 2") = p(y” = 9" |W, x") 
A higher confidence value s.. means the model 
is more confident for the pseudo label y*. How- 
ever, many of these selected pseudo labels with 
high confidence are also incorrect due to the poorly 
calibrated teacher network (Guo et al., 2017), lead- 
ing to error propagation in the self-training. To 
reduce the effect of incorrect pseudo labels, we 
additionally use uncertainty score s,,,, as the indi- 
cator. Specifically, we select a subset of pseudo 
labels which are both high-confidence and low- 
uncertainty as reliable labels, since jointly consid- 
ering confidence and uncertainty can further filter 
the incorrect pseudo labels with high confidence. 
Thus, we define a masked matrix, i.e., 


1 Sun <Qua and Seo > Sco; 
M,* = (3) 
0 Otherwise; 


(2) 


When M = 0, it means the pseudo-label may be 
incorrect and the sample should be masked in the 
self-training. O¢o and Oyq are hyperparameters. 


4.3 Student-Student Collaborative Learning 


Based on Uncertainty-Aware Teacher Learning, the 
teacher network can utilize the correctly pseudo- 
labeled samples to alleviate the negative effect of 
label noise. However, simply masking unreliable 
pseudo-labeled samples can lead to underutiliza- 
tion of the training set, as there is no chance for the 
incorrect pseudo-labeled samples to be corrected 
and further learned. Intuitively, if we can correct 
the incorrect pseudo label with the correct one, 
it will become a useful training sample. There- 
fore, to address these shortcomings and incorpo- 
rate Uncertainty-Aware Teacher Learning to make 
the teacher-student network more effective, we pro- 
pose Student-Student Collaborative Learning. 


The idea of Student-Student Collaborative Learn- 
ing is to utilize two different student networks and 
let them learn from each other. We regard small- 
loss samples as clean samples for training, in each 
batch of data, each student network views its small- 
loss pseudo labels (e.g., pseudo labels of 10% sam- 
ples with the smallest loss) as the reliable labels, 
and transfers such reliable labels to another stu- 
dent network for updating the parameters. These 
small-loss samples are far from the decision bound- 
aries of the two models and thus are more likely 
to be true positives and true negatives (Feng et al., 
2019). In this way, a student network is able to 
not completely rely on all pseudo labels from the 
teacher network, further reducing the risk of learn- 
ing incorrect pseudo labels generated by the poorly 
calibrated teacher network. Moreover, the two dif- 
ferent student networks may have different deci- 
sion boundaries and thus are good at recognizing 
different patterns in data. Different from simply 
masking unreliable pseudo-labeled samples, this 
component also provides the opportunity for the 
incorrect pseudo-labeled samples to be correctly la- 
beled by the other teacher-student network to make 
full use of the training data. 


Specifically, for two student networks s1, sg and 
their parameters W,,,Ws.,, we first let s1 (resp. 
S9) select a small ratio of samples in this batch 
of data D that have small training loss. For these 
selected samples De, (resp. Ds») from s; (resp. 
52), we use the corresponding generated pseudo 
labels a (resp. Yoo) as reliable labels and transfer 
such reliable labels to the other student network 
Sq (resp. 51) for updating the parameters W» (resp. 
W,). The ratio of transferred labels is controlled 
by hyperparameter 6. In this way, two student 


networks can learn from each other’s reliable labels, 
reducing the risk of learning from incorrect pseudo 
labels and making full use of the training data. 


4.4 Training and Inference 


Algorithm 1 in Appendix A.3 gives the pseudocode. 
The process can be divided into three stages: the 
pre-training, the self-training, and the inference. 


Pre-Training Stage We warm up two different 
NER models Wy, and Wz on the noisy DS-NER 
dataset to obtain a better initialization, and then 
duplicate the parameters W for both the teacher W; 
and the student W, (ie., Wi,= Ws,= Wa, Wi,= 
W..= Wag). The training objective function is the 
cross entropy loss with the following form: 


1 
C=, » yilog(p(yi|Ws,2i)) (4) 


where y; means the i-th token label of the i-th token 
x; in the DS-NER corpus Dg; and p(y;|Ws, x;) 
denotes its probability produced by student network 
W,. N is the size of the training corpus. 


Self-Training Stage In this stage, we select reli- 
able pseudo-labeled tokens to train the two teacher- 
student networks respectively. Specifically, we se- 
lect reliable labels generated by teachers W; and 
supervise the students W, with cross-entropy loss. 
During the label selection, we use the proposed 
Uncertainty- Aware Label Selection to jointly con- 
sider the confidence and uncertainty as shown in 
Eq. 3 to reduce the effect of incorrect pseudo- 
labeled samples. Meanwhile, we use Student- 
Student Collaborative Learning to allow student 
networks can learn from each other’s reliable la- 
bels by selecting the pseudo labels from small-loss 
samples. Therefore, the training objective function 
of student networks W, in this stage is the cross 
entropy loss with the following form: 


1 is es 
L=-= > Migilog(p(GilWs,2i)) 6) 
ds 


where 7%; means the i-th pseudo-label generated 
by Student-Student Collaborative Learning and its 
teacher W;. p(%j;|Ws, 2;) denotes its probability 
produced by student network W, on generated 
pseudo-label. 1/; is indicator where the i-th token 
x; should be masked according to Eq. 3. Mean- 
while, if y is the transferred pseudo-label from 
the other student, /; will be automatically set to 
1 (unmasked). That is, we are more inclined to 


Method CoNLL03 OntoNotes5.0 Webpage Wikigold Twitter 
etho 
P R Fl P R Fl P R Fl P R Fl P R Fl 

KB-Matching 81.13 63.75 71.40 63.86 55.71 59.51 62.59 45.14 52.45 47.90 47.63 47.76 40.34 32.22 35.83 
BiLSTM-CRF 75.50 49.10 59.50 68.44 64.50 6641 58.05 3459 43.34 47.55 39.11 42.92 46.91 14.18 21.77 
DistiIROBERTa 77.87 69.91 73.68 66.83 68.81 67.80 56.05 59.46 57.70 48.85 52.05 50.40 45.72 43.85 44.77 
RoBERTa 82.29 70.47 75.93 66.99 69.51 68.23 59.24 62.84 60.98 47.67 58.59 52.57 50.97 42.66 46.45 
AutoNER 75.21 60.40 67.00 64.63 69.95 67.18 48.82 54.23 51.39 43.54 52.35 47.54 43.26 18.69 26.10 
LRNT 79.91 61.87 69.74 67.36 68.02 67.69 46.70 48.83 47.74 45.60 46.84 46.21 46.94 15.98 23.84 
Co-teaching+ 86.04 68.74 76.42 66.63 69.32 67.95 61.65 55.41 58.36 55.23 49.26 52.08 51.67 42.66 46.73 
JoCoR 83.65 69.69 76.04 66.74 68.74 67.73 62.14 58.78 60.42 5148 51.23 51.35 49.40 45.59 47.42 
NegSampling 80.17 77.72 78.93 64.59 72.39 68.26 70.16 58.78 63.97 49.49 55.35 52.26 50.25 44.95 47.45 
BOND 82.05 80.92 8148 67.14 69.61 68.35 67.37 64.19 65.74 53.44 68.58 60.07 53.16 43.76 48.01 
SCDL 87.96 79.82 83.69 67.49 69.77 68.61 68.71 68.24 68.47 62.25 66.12 64.13 59.87 44.57 51.09 
ATSEN 85.75 83.86 84.79 65.69 70.71 68.11 71.08 70.03 70.55 57.67 54.71 56.15 59.31 45.83 51.71 
CENSOR 87.33 85.90 86.61 67.11 71.01 69.01 75.89 72.30 74.05 66.01 68.10 67.05 58.63 47.38 52.41 


Table 1: Main results on five DS-NER datasets. We report the baseline results from Liang et al. (2020); Zhang et al. 
(2021a) and our experimental results with their official implementation in our devices. 


trust judgments from the student model because 
the student network is updated earlier and more 
frequently than the teacher network, and therefore 
better able to capture the changes of pseudo labels. 
N is the size of the training corpus. 

Different from the optimization of the student 
network, we apply EMA as Zhang et al. (2021a) to 
gradually update the parameters of the teacher: 


W, «+ aW, + (1-—a)W, (6) 


where a denotes the smoothing coefficient. With 
the conservative and ensemble properties, the us- 
age of EMA has largely mitigated the bias. As a 
result, the teacher tends to generate more reliable 
pseudo labels, which can be used as new supervi- 
sion signals in the denoising self-training stage. 


Inference Stage In the inference stage, only the 
best model Woest € {Wz,, Ws,, Wi., Ws, } on the 
dev set is adopted for predicting the test data. 


5 Experiment 


5.1 Dataset 


We conduct experiments on five DS-NER datasets, 
including CoNLLO3 (Tjong Kim Sang and 
De Meulder, 2003), Webpage (Ratinov and Roth, 
2009), Wikigold (Balasuriya et al., 2009), Twitter 
(Godin et al., 2015) and OntoNotes5.0 (Weischedel 
et al., 2013). For the fair comparison, we follow 
the same knowledge bases and settings as Liang 
et al. (2020), re-annotate the training set by distant 
supervision, and use the original dev and test set. 
Statistics of datasets are shown in Appendix A.1. 


5.2 Evaluation Metrics and Baselines 


We use Precision (P), Recall (R), and F1 score 
as our evaluation metrics. We compare CENSOR 
with various baseline methods, including super- 
vised methods and DS-NER methods. We also 
present the results of KB-Matching, which directly 
uses knowledge bases to annotate the test sets. 


Supervised Methods We select BILSTM-CRF 
(Ma and Hovy, 2016), RoBERTa (Liu et al., 2019) 
and DistilRoBERTa (Sanh et al., 2019) as original 
supervised methods. As trained on noisy DS-NER 
datasets, these methods achieve poor performance. 


DS-NER Methods We compare several DS-NER 
baselines. AutoNER (Shang et al., 2018) modifies 
the standard CRF to get better performance under 
the noise. LRNT (Cao et al., 2019) leaves training 
data unexplored fully to reduce the negative effect 
of noisy labels. Co-teaching+ (Yu et al., 2019) 
and JoCoR (Wei et al., 2020) are two classical col- 
laborative learning methods to handle noisy labels 
in computer vision area. NegSampling (Li et al., 
2021) uses down-sampling in non-entities to relief 
the misleading from incomplete annotation. 


Teacher-Student Methods for DS-NER _ Specifi- 
cally, BOND (Liang et al., 2020) designs a teacher- 
student network and selects high-confidence predic- 
tions as pseudo labels to get a robust model. SCDL 
(Zhang et al., 2021b) improves the performance 
by training two teacher-student networks and se- 
lecting consistent high-confidence predictions be- 
tween two teachers as pseudo labels. ATSEN (Qu 


Method P R Fl 


CENSOR 87.33 85.90 86.61 
-w/o UTL 86.56 (-0.77) 84.37 (-1.53) 85.45 (-1.16) 
-wlo SCL 86.44 (-0.89) 83.98 (-1.92) 85.19 (-1.42) 


Table 2: Ablation study on CoNLLO3. UTL means 
Uncertainty-Aware Teacher Learning and SCL means 
Student-Student Collaborative Learning. 


et al., 2023) considers both consistent and inconsis- 
tent predictions with high confidence between two 
teachers and further proposes a fine-grained teacher 
updating method. We report the results of ATSEN 
with official implementation in our devices. 


5.3. Experimental Settings 


Following Qu et al. (2023), we adopt RoBERTa- 
base and DistiLRoBERTa-base as two NER models 
for two teacher-student networks. We use Adam 
(Kingma and Ba, 2015) as our optimizer. We list 
detailed hyperparameters in the Appendix A.2. 


5.4 Main Results 


Table 1 presents the performance of different meth- 
ods measured by precision, recall, and F1 score. 
Specifically, (1) CENSOR achieves new SOTA 
performance, showing superiority in the DS-NER 
task; (2) Compared to original supervised meth- 
ods, including BiLSTM-CRF, RoBERTa, and Dis- 
tilRoBERTa, CENSOR improves the F1 score with 
an average increase of 23.04%, 10.96%, and 8.99%, 
respectively, which demonstrates the necessity of 
DS-NER models and the effectiveness; (3) Com- 
pared to classical de-noising methods in the com- 
puter vision area (e.g., Co-teaching+), simply using 
these methods can not achieve strong performance, 
since these methods were not initially designed 
for sequence labeling tasks and ignore the charac- 
teristics of the DS-NER task. (4) Compared with 
teacher-student methods such as BOND, SCDL, 
and ATSEN, CENSOR achieves advanced perfor- 
mance, confirming that these teacher-student meth- 
ods achieve limited performance because of the 
incorrect pseudo-labeled samples. 


5.5 Analysis 


Ablation Study Shown in Table 2, it is clear that 
Uncertainty-Aware Teacher Learning and Student- 
Student Collaborative Learning are both important 
to the model performance. Removing each compo- 
nent can lead to a simultaneous decrease in preci- 
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Figure 3: Fl on CoNLLO3 with different noise ratios. 


Method P R FI 
BOND 80.87 (-13.49) 78.04 (-7.09) 79.43 (-10.08) 
SCDL 94.18 (-0.18) 77.11(-8.02) 84.80 (- 4.71) 
ATSEN 93.01 (- 1.35) 82.96 (-2.17) 87.70 (- 1.87) 
CENSOR 94.36 $5.13 89.51 


Table 3: Comparison of the effectiveness of reducing 
label noise on CoNLLO3. 


sion and recall at the same time, showing that pro- 
posed components indeed improve performance. 


Robustness to Different Noise Ratios To inves- 
tigate the robustness of the CENSOR in different 
noise ratios, we randomly replace k% entity la- 
bels in the clean version (instead of the distantly- 
supervised version) of CoNLLO3 training set with 
other entity types or non-entity. In this way, we can 
construct different noise ratios of label noise and 
we further report the test Fl score on CoNLLO3. 
As shown in Figure 3, CENSOR achieves consis- 
tent advanced performance in different noise ra- 
tios, showing its satisfactory de-noising ability and 
strong robustness. Meanwhile, when the noise ratio 
is above 50%, CENSOR achieves more significant 
robustness, since CENSOR can select and generate 
more reliable labels due to the Uncertainty-Aware 
Teacher Learning and Student-Student Collabora- 
tive Learning from highly noisy data. More de- 
tailed data can be found in Table 9 in the Appendix. 


Effectiveness of Reducing Learned Noise To 
confirm previous teacher-student methods achieve 
limited performance because of incorrectly pseudo- 
labeled samples, we try to explore the effective- 
ness of reducing label noise from different teacher- 
student methods, including CENSOR, BOND, 
SCDL, ATSEN. Specifically, we report the average 
F1 score of all selected (unmasked) pseudo labels 
for training during the self-training stage, using the 
labels from the clean version of the CoNLLO3 train- 


Method P R F1 


BOND 80.42 (-9.44) 76.46 (-8.69) 78.39 (-9.05) 
SCDL 87.42 (-2.44) 75.85 (-9.30) 81.22 (-6.22) 
ATSEN 87.84 (-2.02) 82.83 (-2.32) 85.26 (-2.18) 
CENSOR 89.86 $5.15 87.44 


Table 4: Comparison of teacher pseudo-labeling ability 
of different teacher-student methods on CoNLLO3. 
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Figure 4: Fl on CoNLLO3 with different threshold ou 
in Uncertainty-Aware Label Selection. 


ing set as ground truth labels. As shown in Table 3, 
CENSOR achieves a consistent advanced F1 score, 
which indicates CENSOR can select more correct 
labels based on Uncertainty- Aware Label Selection 
and Student-Student Collaborative Learning. Thus, 
CENSOR can use more correct pseudo labels to 
update the parameters of student networks and fur- 
ther avoid error propagation, leading to outstanding 
overall performance on the test set. 


Effectiveness of Teacher Pseudo-labeling Af- 
ter confirming the effectiveness of reducing label 
noise, we attempt to further explore whether the 
teacher network could use more reliable labels to 
avoid error propagation, thus generating more cor- 
rect pseudo labels. As shown in Table 4, we report 
the best F1 score of teacher networks from differ- 
ent teacher-student methods on the clean version 
of CoNLLO3 training set. In detail, the teacher 
network from CENSOR correctly labels 87.44% 
samples, achieving the most advanced precision, 
recall, and Fl score. Compared to other teacher- 
student methods, including BOND, SCDL, and 
ATSEN, CENSOR improves the F1 score with an 
average increase of 9.05%, 6.22%, and 2.18%, re- 
spectively, which demonstrates using more correct 
labels can avoid error propagation and make the 
teacher network generate more reliable labels. In 
this way, the teacher network can make full use 
of the noisy samples in the DS-NER training set 
and help the teacher-student framework achieve 
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Figure 5: Fl on CoNLLO3 with different ratio 6 of se- 
lected labels in Student-student Collaborative Learning. 


outstanding performance on the test set. 


Parameter Study As shown in Figure 4 and Fig- 
ure 5, we conduct experiments to explore the im- 
pact of important hyperparameters to further un- 
derstand Uncertainty-Aware Label Selection and 
Student-Student Collaborative Learning. Overall, 
although the choice of different hyperparameters 
will have some impact on the model performance, 
as long as the hyperparameters are chosen wisely 
rather than at extreme values (e.g., wrongly setting 
the threshold o,,, in Uncertainty-Aware Label Se- 
lection to 0), the performance of the model will 
always be improved over what it would have been 
without using the components. More detailed anal- 
ysis are shown in the Appendix A.5. 


Case Study We also conduct the case study to 
understand the advantage CENSOR with two ex- 
amples in Table 5 and Table 6. We show the pre- 
diction of BOND, SCDL, ATSEN and CENSOR 
on a training sequence with label noise and a test 
sequence with ground truth. As shown in Table 
5, BOND and SCDL can slightly generalize to 
unseen mentions and relieve partial incomplete 
annotation, e.g., they can successfully recognize 
the “John McNamara" and “New York’. However, 
these methods still suffer from label noise. For 
comparison, for hard labels “California Angels", 
CENSOR and ATSEN are able to detect them with 
advanced teacher-student design (e.g., Adaptive 
Teacher Learning in ATSEN and Student-Student 
Collaborative Learning in CENSOR) instead of re- 
lying purely on distant labels. However, as shown 
in Table 6, ATSEN still struggles to distinguish be- 
tween easily confused samples and achieves inade- 
quate generalization. In contrast, as CENSOR can 
use fewer incorrect pseudo-labeled samples due to 
Uncertainty-Aware Teacher Learning and Student- 


Distant Match: [Johnson]ppr is the second manager to be hospitalized after California [Angels]prr 
skipper [John}p—ER McNamara was admitted to New [York]pmr ’s [Columbia]ppr Presby Hospital . 
Ground Truth: [Johnson]ppr is the second manager to be hospitalized after [California Angels]ore 
skipper [John McNamara]prrR was admitted to [New York]Loc ’s [Columbia Presby Hospitallorc . 


BOND: [Johnson]ppr is the second manager to be hospitalized after [California]Loc [Angels]pnr 
skipper [John McNamara]prr was admitted to [New York]Loc ’s [Columbia]prr Presby Hospital. 
SCDL: [Johnson]pmr is the second manager to be hospitalized after [California]Loc [Angels]prr 
skipper [John McNamara]prrR was admitted to [New York]Loc ’s [Columbia Presby Hospitallorc . 
ATSEN: [Johnson]ppr is the second manager to be hospitalized after [California Angels]orc 
skipper [John McNamara]prr was admitted to [New York]Loc ’s [Columbia Presby Hospitallore . 


CENSOR: [Johnson]pEr is the second manager to be hospitalized after [California AngelsJora 
skipper [John McNamara]prr was admitted to [New York]Loc ’s [Columbia Presby Hospitallorc . 


Table 5: Case study with CENSOR and previous teacher-student methods for DS-NER. The sentence is from the 


CoNLLO3 training set. 


Ground Truth: All-conquering [Juventus]org field their most recent signing, | Portuguese |\;1isc defender [Dimas]prr, 
while [Alessandro Del Piero]prr and [Croat})i1sc [Alen Boksic]ppp lead the attack. 


BOND: All-conquering [Juventus]ora field their most recent signing, [Portuguese]org defender [Dimas]prr, 
while [Alessandro Del Piero]prr and [Croat Alen Boksic]ppp lead the attack. 

SCDL: All-conquering [Juventus]ora field their most recent signing, | Portuguese |\i;isc defender [Dimas]prr, 
while [Alessandro Del Piero]pgr and [Croat Alen Boksic]prr lead the attack. 

ATSEN: All-conquering [Juventus]ore field their most recent signing, |Portuguese}\;isc defender [Dimas]prr, 
while [Alessandro Del Piero]pgr and [Croat]orc [Alen Boksic]prr lead the attack. 


CENSOR: All-conquering [Juventus]org field their most recent signing, | Portuguese |,i;s¢ defender [Dimas]ppr, 
while [Alessandro Del Piero]pgr and [Croat},s1sc [Alen Boksic]prr lead the attack. 


Table 6: Case study with CENSOR and previous teacher-student methods for DS-NER. The sentence is from the 


CoNLLO3 test set. 


Student Collaborative Learning, a higher degree of 
robustness and generalization can be achieved. 


6 Conclusion 


We introduce CENSOR, a novel teacher-student 
framework designed for DS-NER task. CENSOR 
incorporates Uncertainty-Aware Teacher Learning, 
utilizing prediction uncertainty to guide the pseudo- 
label selection. It mitigates the usage of incor- 
rect pseudo labels by avoiding reliance on confi- 
dence scores from poorly calibrated teacher net- 
works. We also introduce Student-Student Col- 
laborative Learning to enable a student network 
not to completely rely on pseudo labels from its 
teacher network, minimizing the risk of learning 
incorrect ones. Meanwhile, this component allows 
the training set can be fully explored. Our exper- 
imental results demonstrate CENSOR’s superior 
performance compared to previous methods. 


Limitations 


Our proposed CENSOR has two tiny limitations, 
specifically: (1) CENSOR focuses on addressing 
the label noise in the DS-NER task, and all our 
analyses are specific to this task. As a result, our 


model may not be robust enough compared to other 
models if it is not specific to the DS-NER task. 
(2) Due to introducing the proposed Uncertainty- 
Aware Teacher Learning, our model will perform 
multiple forward passes in the uncertainty estima- 
tion phase, increasing the self-training time. Com- 
pared to ATSEN, the self-training of our model 
takes about 4 times as long as that of ATSEN. 
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A Appendix 
A.1  DS-NER Datasets 


Statistics of five datasets are shown in Table 7. 


Dataset Train Dev Test Types 
Sentence 14041 3250 3453 
eae Token 203621 51362-46435 + 
Sentence 115812 15680 12217 
OntoNotes5.0 “Token 2200865 304701.:«o230118—Sts«*S 
Sentence 385 99 135 
Webpage Token 5293 1121 1131 si 
oo Sentence 1142 280 274 
witeolt Token 25819 6650 6538 * 
Twitter Sentence 2393 999 3844 10 
’ Token 44076 15262 58064 


Table 7: The statistics of five DS-NER datasets. 


A.2. Hyperparameters 

Detailed hyperparameters are shown in Table 8. 
Experiments are run on a single NVIDIA A40. 
A.3 Pseudocode 


Algorithm 1 gives the pseudocode of our method. 


A.4 Robustness to Different Noise Ratios 
Detailed data in Figure 3 can be found in Table 9. 


A.5 Parameter Study 


In Figure 4 and Table 10, we analyze the impact of 
Oua in Eq.3 within Uncertainty-Aware Label Selec- 
tion. Notably, for minimal values of o,¢, such as 0 
and 0.001, the Uncertainty-Aware Label Selection 
phase filters and masks all samples. Consequently, 


Algorithm 1 Training Procedure of CENSOR. 


Input: DS-NER dataset Da, = {(Xi, Yi) }, 

Parameter: Two teacher-student network parameters, including Wey, 5 Ws, 3 
W;,, and W;. 

Output: The best model 


1: Pre-training two models W4, We with Das. >Pre-Training. 
2: Initialize two teacher-student networks: Wi, «— Wa, Ws, + Wa, 
Wt, <— We. Ws» — We. 

3: Initialize training step: step < 0. 

4: Initialize noisy labels: Yr + Y,Yr7 + Y. 

5: while not reach max training epochs do 

6:  Getabatch D = (X) vf?) from Das, 

step + step+1. 

7 Get pseudo labels via the teacher Wi, : Wt, : 
yf? ies f(x®; Wi, ), 
YY & F(X; Wey). 

8: Select reliable labels via Uncertainty-Aware Teacher Learning: 
Estimate Confidence and Uncertainty by Eq.3 and Eq.4, separately 
T;? < Uncertainty-Aware Label Selection(Y(” : ¥{” ); 
Ti < Uncertainty-Aware Label Selection(Y |? : ye ), 


>Self-Training. 


9: Select reliable labels via Student-Student Collaborative Learning: 
D3, = arg ming. p)>5%| DI Loss(s1,D), 
//sample 5% small-loss instances 
D35 = arg mins. 5|/>5%|D| Loss(s2,D). 


//sample 5% small-loss instances 
Transfer the pseudo labels between Ds and Ds, ‘ 
10: Update the student W,, and Ws, by Eq. 7. 
11: Update the teacher Wz, and Wz, by Eq. 8. 
12: end while 
13: Evaluate models Wi,,Ws,, Weta, Wse on Dev set. 
14: return The best model W € {W:,,Ws,, Wt, Woo } 


the student network becomes incapable of param- 
eter updates, rendering the entire teacher-student 
framework non-trainable. When the parameter oy 
is in a reasonable interval, the effectiveness of the 
model is always improved due to the inclusion of 
filtered reliable labels in the self-training stage. Ul- 
timately, when o1,_ reaches an excessive magnitude, 
the filtering capacity of the Uncertainty-Aware La- 
bel Selection stage is nullified, rendering the out- 
come akin to Uncertainty-Aware Teacher Learning 
omission. Therefore, while using different values 
of Gua tends to improve the performance, choosing 
Cua Wisely and rationally is crucial for optimizing 
Uncertainty-Aware Teacher Learning. In Figure 5 
and Table 11, we also explore the impact of the 
ratio 6 of selected labels in Student-Student Col- 
laborative Learning. A small 6 enables the student 
network to partially leverage reliable labels from 
its counterpart, resulting in improved outcomes 
compared to scenarios without such collaborative 
learning. As 6 increases, the transfer of these re- 
liable labels diminishes the likelihood of learning 
incorrect labels from teacher-generated pseudo la- 
bels, thereby enhancing overall performance. Con- 
versely, an excessively large 6 adversely affects 
performance. This is attributed to the pseudo labels 
of selected samples, which, with a high transfer 
proportion (e.g., d = 0.8), cease to qualify as small- 


Name CoNLL03 Ont5.0 Webpage Wikigold Twitter 


Learning Rate le-5 2e-5 le-5 le-5 2e-5 
Batch Size 8 16 16 16 8 
EMA a 0.995 0.995 0.99 0.99 0.995 
Sche. Warmup 200 500 100 200 200 
Total Epoch 50 50 50 50 50 
Pre-training Epoch 1 2, 12 ) 6 
Oco in Eq.5 of UTL 0.9 0.9 0.9 0.9 0.9 
Cua in Eq.5 of UTL 0.01 0.05 0.1 0.2 0.2 
XK in Eq.2 of UTL 8 8 8 8 8 
Dropout Rate 0.5 0.5 0.5 0.5 0.5 
ratio 6 of SCL 0.3 0.4 0.3 0.1 0.1 


Update Cycle 
(iterations) 


6000 7240 300 2000 3200 

Table 8: Hyperparameters on five DS-NER datasets. 
UTL means Uncertainty-Aware Teacher Learning and 
SCL means Student-Student Collaborative Learning. 


Ratio ATSEN SCDL BOND _ Ours 
10% 90.19 90.15 87.63 90.38 
20% 90.03 89.85 88.03 90.22 
30% 89.79 89.48 86.80 89.88 
40% 88.97 88.49 84.42 89.11 
50% 84.77 83.66 82.56 86.27 
60% 82.55 82.64 80.94 84.96 
70% 75.75 76.88 77.38 80.66 
80% 56.61 55.26 50.49 59.80 
90% 19.59 17.09 14.85 22.26 


Table 9: Fl on CoNLLO3 with different noise ratios. 


loss samples and are more prone to containing 
noise. Hence, proportion selection of 6 proves crit- 
ical for optimizing the efficacy of Student-Student 
Collaborative Learning. 


A.6 Difference between Previous Methods 


We will carefully compare previous methods to 
explain our motivation and the differences between 
previous methods and our proposed components. 


Uncertainty-Aware Teacher Learning Most re- 
search on uncertainty estimation focuses on com- 
puter vision because it provides visual validation 
on uncertainty quality. For example, Rizve et al. 
(2021) first introduces uncertainty to filter the low- 
quality labels in the semi-supervised image classi- 
fication task. However, very little research about 
uncertainty has been presented in the natural lan- 
guage process domain. As far as we know, we 
are the first to introduce the uncertainty in the DS- 
NER task. Meanwhile, different from the instance- 
level image classification task, the DS-NER task is 
based on token-level classification, which requires 


Aue P R Fl 

-w/o UTL 86.56 84.37 85.45 
0.000 00.00 00.00 00.00 
0.001 00.00 00.00 00.00 
0.005 85.65 82.68 84.14 
0.010 87.33 85.90 86.61 
0.500 S22 84 S505) 
0.800 87.60 85.06 86.32 
1.000 87.27 85.56 86.41 
10.00 87.27 85.56 86.41 
100.0 86.56 84.37 85.45 
1,000 86.56 84.37 85.45 


Table 10: Fl on CoNLLO3 with different threshold 
Oua in Uncertainty-Aware Label Selection. UTL means 
Uncertainty-Aware Teacher Learning. 


Kk P R Fl 


-w/ioSCL 86.44 83.98 85.19 
0.1 86.81 84.92 85.85 
0.2 87.35 84.33 85.82 
0.3 87.33 85.90 86.61 
0.4 86.95 84.58 85.75 
0.5 86.28 84.41 85.33 
0.8 86.27 84.01 85.13 
1.0 85.70 83.68 84.68 


Table 11: Fl on CoNLLO3 with different ratio 6 of se- 
lected labels in Student-Student Collaborative Learning. 
SCL means Student-Student Collaborative Learning. 


the model to capture the inherent token-wise label 
dependency. So different from estimating uncer- 
tainty at the instance level, we analyze the unique 
characteristics of the DS-NER task in the paper 
and design Uncertainty-Aware Teacher Learning 
to measure uncertainty at the token level. On the 
other hand, we are the first to find that previous 
teacher-student methods achieved limited perfor- 
mance because poor network calibration produces 
incorrect pseudo-labeled samples in the DS-NER 
task. Thus, we attempt to use uncertainty as the 
indicator to reduce the effect of incorrect pseudo 
labels within the teacher-student framework. 


Student-Student Collaborative Learning Col- 
laborative Learning (Han et al., 2018; Yu et al., 
2019; Wei et al., 2020) is a popular method to 
handle label noise, which attempts to use two dif- 
ferent networks to provide multi-view knowledge 
and let them learn from each other. Co-teaching 
(Han et al., 2018) first attempts to completely ex- 
change reliable samples of two different networks 
and then update the networks by the exchanged 
multi-view information. Co-teaching+ (Yu et al., 
2019) further proposes to use disagreement strategy 
to update two networks, i.e., only using prediction 


disagreement data from two networks to update 
two networks. JoCoR (Wei et al., 2020) aims to 
use a designed joint loss to reduce the diversity of 
two networks during training and further improve 
the robustness of two networks. However, these 
methods are designed for tasks in the computer 
vision area (especially image classification), and 
as shown in Table 1, these methods often achieve 
limited performance in the DS-NER task. SCDL 
designs the teacher-student framework and adopts 
collaborative learning in the DS-NER task. Similar 
to Co-teaching, all of the pseudo labels predicted by 
the teacher are applied to update the noisy labels 
of the peer teacher-student network periodically 
since two teacher-student networks have different 
learning abilities based on different network struc- 
tures. Different from SCDL, we aim to utilize two 
different student networks and let them learn from 
each other to reduce the negative effect of incorrect 
pseudo labels. Specifically, instead of completely 
exchanging pseudo labels between two teachers, 
we allow students to transfer reliable pseudo labels 
and at the same time allow students to learn on 
their own pseudo labels generated by their teacher 
network. In this way, we not only ensure that the 
transferred pseudo labels contain multi-view in- 
formation but also ensure that the pseudo labels 
we transfer are high-quality by selective transfer. 
Meanwhile, as the student network is updated ear- 
lier and more frequently than the teacher network, 
the student network is better able to capture the 
changes of pseudo labels than the teacher network. 


Relation between Two Components Designs on 
Uncertainty-Aware Teacher Learning and Student- 
Student Collaborative Learning are not indepen- 
dent. The two components can collaborate and 
achieve better results. Specifically, (1) Uncertainty- 
Aware Teacher Learning can help the teacher net- 
work to generate more reliable pseudo labels and 
further reduce the risk of the student network up- 
dating parameters on the incorrect pseudo label. At 
the same time, a more efficient student network 
can be achieved by learning to pseudo-label with 
fewer errors, which will further improve the effi- 
ciency of the Student-Student Collaborative Learn- 
ing component; (2) Based on Uncertainty-Aware 
Teacher Learning, the teacher network can utilize 
the correctly pseudo-labeled samples to alleviate 
the negative effect of label noise. However, sim- 
ply masking unreliable pseudo-labeled samples can 
lead to underutilization of the training set, as there 


is no chance for the incorrect pseudo-labeled sam- 
ples to be corrected and further learned. Student- 
Student Collaborative Learning can allow the stu- 
dent network to learn from transferred reliable la- 
bels from the other student network. Therefore, 
this component further enables a full exploration 
of mislabeled samples rather than simply filtering 
unreliable pseudo-labeled samples. Through the 
collaboration of the two components, as shown in 
Table 1, CENSOR achieves the best performance 
among 12 baselines. 


